By Jongmin Lim BCB420

Pathway and Network Analysis


Intro

The purpose of this analysis is to identify the transcriptomic difference between iPSC and iPSC derived astrocyte. There are 5 patients of iPSC cell and corresponding iPSC derived astrocyte. After load raw data, data has been filtering based on feature a leat 1 read per million in n of the samples, n as the size of the smallest groups of replicates, which is 2 replicates. After filtering, trimm mean of m value, and log normalize the data, HUGO symboles mapped to the corresponding gene. Gene with NA as HUGO symboles removed. And later, all gene with no HUGO symboles is removed.

Limma-Trend and Negative Binomial Generalized LInear Models with Quasi-likelihood test from limma v3.40.6 and edgeR package v3.26.8 perfomred for differnetial gene expression analysis. 9977 genes met the thresdhol for statistical significance with P value less than 0.05. 8684 genes pass the correction. Differential genes, upregualted and downregualted, analyzed with thresholded overrepresentation analysis using g:profiler. The heatmap result shows the clusters of gene that differentiate between iPSC and iPSC derived astrocyte.

GSEA will be use as non-thresholded pathwasy analysis and use cytoscape with enrichment map pipiline to summarize tand visualize the results.

Install required packages

if (!requireNamespace("colorRamps", quietly=TRUE)) {
  install.packages("colorRamps")
}
if (!requireNamespace("doBy", quietly=TRUE)) {
  install.packages("doBy")
}
if (!requireNamespace("gprofiler2", quietly=TRUE)) {
  install.packages("gprofiler2")
}
if (!requireNamespace("colorRamps", quietly=TRUE)) {
  install.packages("colorRamps")
}
#install required R and bioconductor packages
tryCatch(expr = { library("RCurl")}, 
         error = function(e) {  install.packages("RCurl")}, 
         finally = library("RCurl"))

#use library
tryCatch(expr = { library("limma")}, 
         error = function(e) { source("https://bioconductor.org/biocLite.R")
           biocLite("limma")}, 
         finally = library("limma"))
tryCatch(expr = { library("Biobase")}, 
         error = function(e) { source("https://bioconductor.org/biocLite.R")
           biocLite("Biobase")}, 
         finally = library("Biobase"))
tryCatch(expr = { library("ggplot2")}, 
         error = function(e) { install.packages("ggplot2")}, 
         finally = library("ggplot2"))

#For creating json and communicating with cytoscape
tryCatch(expr = { library("httr")}, 
         error = function(e) { install.packages("httr")}, 
         finally = library("httr"))
tryCatch(expr = { library("RJSONIO")}, 
         error = function(e) { install.packages("RJSONIO")}, 
         finally = library("RJSONIO"))


Load library

#List of packages that need to be active for this project
library(GEOmetadb)
library(knitr)
library(tidyr)
library(biomaRt)
library(edgeR)
library(BiocGenerics)
library(ComplexHeatmap)
library(circlize)
library(colorRamps)
library(dplyr)
library(doBy)
library(gprofiler2)
library(kableExtra)
library(limma)


Background Information

This dataset is RNAseq expression of iPSC and iPSC-dervied Astrocyte. The purpose of this database is to validate identity of iPSC-derived astrocyte by looking at expression RBA. This is important since they need to get correct cell type to proceed to experiment.

Information about Platform GSE

#GEO description of of my dataset
gse <- getGEO("GSE116124", GSEMatrix=FALSE)
kable(data.frame(head(Meta(gse))), format = "html")
contact_address contact_city contact_country contact_email contact_institute contact_laboratory
Avinguda Gran Via, 199-203 Hospitalet de Llobregat (Barcelona) Spain Institute of Biomedicine of the University of Barcelona (IBUB) Neural Commitment and Differentiation Lab


More Information about dataset

current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))


Platform Title: Illumina HiSeq 2000 (Homo sapiens)
Original submission date: Nov 02 2010
Last update date: Mar 27 2019
Organism: Homo sapiens
No. of GEO datasets that use this technology: 7897
No. of GEO samples that use this technology: 122103


Downlaod and Organize

Download Data

  • Get the expression Data
#Asign the file
sfiles = getGEOSuppFiles('GSE116124')
fnames = rownames(sfiles)
cell_exp = read.delim(fnames[1], header=TRUE, check.names = FALSE)


  • Load first 15 row of dataset with html table format
#Load the dataset with html table format
kable(cell_exp[1:15,1:5], format = "html")
id_gene,gene_name,gene_type AD5607 AD5609 AD5610 AD5611
ENSG00000223972.5,DDX11L1,transcribed_unprocessed_pseudogene 0 0 0 0
ENSG00000227232.5,WASH7P,unprocessed_pseudogene 356 288 110 130
ENSG00000278267.1,MIR6859-1,miRNA 0 0 0 0
ENSG00000243485.5,MIR1302-2HG,lincRNA 4 6 0 0
ENSG00000284332.1,MIR1302-2,miRNA 0 0 0 0
ENSG00000237613.2,FAM138A,lincRNA 0 0 0 0
ENSG00000268020.3,AL627309.6,unprocessed_pseudogene 0 0 0 0
ENSG00000240361.2,OR4G11P,transcribed_unprocessed_pseudogene 0 2 0 0
ENSG00000186092.5,OR4F5,protein_coding 0 1 0 0
ENSG00000238009.6,AL627309.1,lincRNA 10 16 28 17
ENSG00000239945.1,AL627309.3,lincRNA 6 0 26 10
ENSG00000233750.3,CICP27,processed_pseudogene 117 45 39 63
ENSG00000268903.1,AL627309.7,processed_pseudogene 575 590 19 26
ENSG00000269981.1,AL627309.8,processed_pseudogene 107 127 0 9
ENSG00000239906.1,AL627309.2,antisense_RNA 1 8 0 0


Organize data

  • ID_gene,gene_name, and gene_type are in one colums.
  • Need to divide into first 3 columns
#Separate the first column in to 3 total based on the id, name,and type of gene
cell_exp <- separate(cell_exp, col = "id_gene,gene_name,gene_type", into = c("id_gene", "gene_name", "gene_type"), sep = "\\,")

#Assign the column names
colnames(cell_exp) <- list("ensembl_gene_id", "gene_name", "gene_type", 
                           "iPSC.A", "iPSC.B",    
                           "hASTRO.E", "hASTRO.E",
                           "hASTRO.A","hASTRO.B",
                           "hASTRO.C", "hASTRO.D",
                           "iPSC.C", "iPSC.D")
#Remove unessacry data for this comparison
cell_exp <- cell_exp[,-(6:7)]

#Need to remove decimal point so I can map identifiersd without version
cell_exp$ensembl_gene_id  <- gsub(pattern = "\\.\\d+$", replacement = "", x = cell_exp$ensembl_gene_id, ignore.case = TRUE)


Filtering

  • Need to filter out geens that have low counts
  • In edgeR, recommended to remove features wihtout at least 1 read per million
  • My sample has 5 groups and 2 replicates
#Change to count per million
cpms = cpm(cell_exp[,4:11])
#Set the rowname of first colum of cell_exp dataset
rownames(cpms) <- cell_exp[,1]
#Remove 1 or lower read per million.
#Keep the replicate group that equal or greater than 2
keep = rowSums(cpms >1) >=2
cell_exp_filtered = cell_exp[keep,]

Define the group

  • Cell type with corresponding GSM number can be find in this link and gse link
  • Divide into 2 groups since there is 2 replicates
  • Open the gse file directly and identify the cell type with correspoding sample code (AD####).
#Define the group
samples <- data.frame(
           lapply(colnames(cell_exp)[4:ncol(cell_exp)],
           FUN=function(x){
             unlist(strsplit(x, split = "\\."))[c(1,2)]})) #Separate based on "."
#Separate columns based on cell type and patient
colnames(samples) <- colnames(cell_exp)[4:ncol(cell_exp)]
rownames(samples) <- c("cell_type", "patient")
samples <- data.frame(t(samples))
samples
##          cell_type patient
## iPSC.A        iPSC       A
## iPSC.B        iPSC       B
## hASTRO.A    hASTRO       A
## hASTRO.B    hASTRO       B
## hASTRO.C    hASTRO       C
## hASTRO.D    hASTRO       D
## iPSC.C        iPSC       C
## iPSC.D        iPSC       D


Normalization

Applying TMM to dataset

  • Trimmed mean: average after remove upper and lower percentage of the data points
  • By default, 30% of the M values and 5% of the A values
  • TMM compare each sample to a reference
  • Data does not need to be modified prior to normalization
#Convert data.frame to matrix
#Need to make sure that use filtered count and matrix
filtered_data_matrix <- as.matrix(cell_exp_filtered[, 4:11])
rownames(filtered_data_matrix) <- cell_exp_filtered$ensembl_gene_id#Add rownames

#Define group as cell type to compare
tmMreq = DGEList(counts = filtered_data_matrix, group = samples$cell_type)

#Calculate normalization factors
tmMreq = calcNormFactors(tmMreq)

#Get normalized data
normalized_counts <- cpm(tmMreq)


Two boxplots

  • Comparison between normalized and not normalized data with box plots
  • There is some difference between two plots
  • Median of each samples are more equal to each other in normalized box plots.
  • I don’t think interquartile range changed
#Separate the area to load 2 graphs
par(mfrow= c(1,2))

#Not normalized boxplot
data2plot <- log2(cpm(cell_exp_filtered[,4:11]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Cell RNAseq Samples")
#draw the median of each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")

#Normalized boxplot
data2plot <- log2(normalized_counts[,1:7])
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Normalized Cell RNAseq Samples")
#draw the median of each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")


Two density graphs

  • Comparison between normalized and not normalized data with density graphs
  • There is some difference between two plots
  • Each line in normalized data are closer together compare to not normalized data except hole that located in around range 0 get bigger.
  • Location of lines in normalized data are different compare to not normalized data
#Separate the area to load 2 graphs
par(mfrow= c(1,2))

#Not normalized data (left)
counts_density <- apply(log2(cpm(cell_exp_filtered[, 4:11])), 2, density)
#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)){
  xlim <- range(c(xlim, counts_density[[i]]$x));
  ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
     ylab="Smoothing density of log2-CPM", main = "Cell RNAseq Samples", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
       col=cols, lty=ltys, cex=0.75,
       border = "blue", text.col = "green4",
       merge = TRUE, bg = "gray90")

#Normalized data (right)
normalized_density <- apply(log2(normalized_counts[, 1:7]), 2, density)
#Calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_density)){
  xlim <- range(c(xlim, normalized_density[[i]]$x));
  ylim <- range(c(ylim, normalized_density[[i]]$y))
}
cols <- rainbow(length(normalized_density))
ltys <- rep(1, length(normalized_density))
#plot the first density plot ot initialize the plot
plot(normalized_density[[1]], xlim=xlim, ylim=ylim, type="n",
     ylab="Smoothing density of normalized log2-CPM", main = "Normalized Cell RNAseq Samples", cex.lab = 0.85)
#plot each line
for (i in 1:length(normalized_density)) lines(normalized_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
       col=cols, lty=ltys, cex=0.75,
       border = "blue", text.col = "green4",
       merge = TRUE, bg = "gray90")


MDS plot

  • MDS plot represent the distance between samples
  • Closer distence mean similar in RNAseq
  • iPSC cluster together. Even if they come from different patient, because of gaining pluripotency by force to express specific genes lead to increasing similarity
  • Astrocyte (hASTRO) are cluster in 2 but seperated. This might happen because one group should be patient with Parkinson disease while another group is without Parkinson disease
#MDS plot to see the distance between samples
plotMDS(tmMreq, labels=rownames(samples),
        main = "MDS Plot of Samples",
        col = c("darkgreen", "darkblue")[factor(samples$cell_type)])


Estimate common and tagwise dispersion

  • Dispersion describe how much variance deviated from teh mean
  • Specific to edgeR and used downstream when calculating differential expression
  • Can estimate common and tagwise disperison
  • Common dispersion calculate common disperions values
  • Tagwise dispersion calculate gene-specific disterpsion
#Estimate common and tagwise dispersion
model_design <- model.matrix(~ samples$patient) 
dispersion <- estimateDisp(tmMreq, model_design)


Plot BCV

  • Dispersion squared is biological coefficient of variation (BCV)
  • Dispersion is a measrue of vriation within samples
  • Each dot represent BCV for each gene
  • Red line represent common dispersion
  • Blue line represent the trend of dataset
  • All the Tagwise are within common and trend line, which is good
#Plot BCV with sample
plotBCV(dispersion, col.tagwise = "black", col.common = "red", main = "Biological Coefficient of Variation of Samples")


Plot mean variable relationship

  • Gray dots are raw varaince of the counts
  • Blue dots are estimated varaince by using tagwise dispersion
  • Red x are variance binned common dispersion
  • Dark red x are average of the raw varaince of each bin of genes
  • Blue line is mean variance related to negative binomial distribution with common dispersion
  • All the line, x, and dots are in same trend within NBline, which is good
  • I got information from this webpage link
#Separate the area to load 2 graphs
par(mfrow= c(1,2))
plotMeanVar(dispersion, 
            show.raw.vars = TRUE,                    
            show.tagwise.vars=TRUE,                  
            NBline=TRUE,                             
            show.ave.raw.vars = TRUE,                
            show.binned.common.disp.vars = TRUE,
            main = "Mean-Variance Relationship of Samples")


Map to HUGO symbols

Biomart

  • Important to working with up to date annotations and the right versions
#Connect to ensembl mart and limit to human datasets
#Using grch37 because I keep get error when using most updated ensembl
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl", host="grch37.ensembl.org")
## Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'RJSONIO'
## Found more than one class "textConnection" in cache; using the first, from namespace 'BiocGenerics'
## Also defined by 'RJSONIO'


Converting Human Ensembl Gene Ids to HGNC symbols

  • Attribute: Ensembl gene Ids and HGNC symbols
  • Filter: Ensembl gene ids
  • Values: Ensembl gene ids
#Filter
biomart_human_filters <- listFilters(ensembl) #Will go with ensembl_gene_id_version since my dataset contain version

#Attributes
kable(searchAttributes(mart = ensembl, 'hgnc') , format="html") #Will go with hgnc_symbol
name description page
62 hgnc_id HGNC ID feature_page
63 hgnc_symbol HGNC symbol feature_page
64 hgnc_trans_name HGNC transcript name ID feature_page


#Check to see if cell_id_conversion file exists
conversion_stash <- "cell_id_conversion.rds"
if(file.exists(conversion_stash)){
  cell_id_conversion <- readRDS(conversion_stash)
}else{
  cell_id_conversion <- getBM(attributes =
                              c("ensembl_gene_id", "hgnc_symbol"),
                              filters = c("ensembl_gene_id"),
                              values = factor(cell_exp_filtered$ensembl_gene_id),  #Values is first colum of my expression matrix
                              mart = ensembl)
  saveRDS(cell_id_conversion, conversion_stash)
}


Merge new identifier

#Merge the new identifier
normalized_counts_annot <- merge(cell_id_conversion, normalized_counts, by.x = 1, by.y = 0, all.y = TRUE)
kable(normalized_counts_annot[1:5,1:5],type = "html")
ensembl_gene_id hgnc_symbol iPSC.A iPSC.B hASTRO.A
ENSG00000000003 TSPAN6 49.9718769 49.3636305 47.307659
ENSG00000000419 DPM1 68.2095759 40.3461140 35.223430
ENSG00000000457 SCYL3 5.3520520 6.4867941 9.453908
ENSG00000000460 C1orf112 33.0721909 15.1552454 6.251778
ENSG00000000971 CFH 0.0290872 0.0290888 19.517745


Number of missing identifiers

#Number of identifier are missing
ensembl_id_missing_gene <- normalized_counts_annot$ensembl_gene_id[
  which(is.na(normalized_counts_annot$hgnc_symbol))]
length(ensembl_id_missing_gene)
## [1] 489
#Yes. (489/16993) = 0.029. About 2.9% of my dataset miss identifiers.


Table of gene that miss identifier as NA

  • Some of gene has NA or empty string for hgnc_symbol, which can be novel transcription or common gene. Therefore, I would like to keep these gene instead of remove the them
#Collect all the duplicate of hgnc_symbol
hugoDuplicated <- normalized_counts_annot[duplicated(normalized_counts_annot$hgnc_symbol), 1:2]

#Collect all the non empty string for hgnc_synbol
hugoEmptyDuplicated <- hugoDuplicated[!(nchar(hugoDuplicated$hgnc_symbol) == 0),]

#Collect all the non NA for hgnc_synbol
nohugoEmptyNADuplicated <- hugoEmptyDuplicated[!is.na(hugoEmptyDuplicated$hgnc_symbol),]

#Only remove gene with NA in hgnc_symbol
finalized_dataset <- normalized_counts_annot
head(finalized_dataset)
##   ensembl_gene_id hgnc_symbol      iPSC.A      iPSC.B  hASTRO.A  hASTRO.B
## 1 ENSG00000000003      TSPAN6 49.97187691 49.36363053 47.307659 36.538029
## 2 ENSG00000000419        DPM1 68.20957588 40.34611405 35.223430 32.577734
## 3 ENSG00000000457       SCYL3  5.35205201  6.48679411  9.453908  8.850399
## 4 ENSG00000000460    C1orf112 33.07219095 15.15524544  6.251778  7.955028
## 5 ENSG00000000971         CFH  0.02908724  0.02908876 19.517745 13.017319
## 6 ENSG00000001036       FUCA2 35.95182763 58.84656721 54.741175 57.407064
##    hASTRO.C  hASTRO.D     iPSC.C      iPSC.D
## 1 38.334551 48.529023 41.0380737 68.83296439
## 2 36.694033 31.360565 60.4500229 36.19971961
## 3 10.914466  9.546098  7.1278251  6.71686095
## 4  7.265151  8.965347 24.9322222 25.53001572
## 5 46.235820 34.699885  0.1213247  0.05944125
## 6 85.039090 88.056402 37.7319762 60.03565979

Differential Gene Expression

Apply New Normalization

#Normalize the data
finalized_dataset %>%
  mutate_at(vars(-ensembl_gene_id, -hgnc_symbol), funs(.+1)) %>%
  mutate_at(vars(-ensembl_gene_id, -hgnc_symbol), funs(log2(.))) -> new_normalized_count_data

#Remove any duplicate ensembl_gene_id
new_normalized_count_data <- new_normalized_count_data[!duplicated(new_normalized_count_data$ensembl_gene_id),]


#Check
kable(new_normalized_count_data[1:10,], type="html")%>%
  kable_styling(bootstrap_options = "striped", full_width = F)
ensembl_gene_id hgnc_symbol iPSC.A iPSC.B hASTRO.A hASTRO.B hASTRO.C hASTRO.D iPSC.C iPSC.D
ENSG00000000003 TSPAN6 5.6716296 5.6543104 5.594180 5.230281 5.297725 5.630202 5.3936247 6.1258363
ENSG00000000419 DPM1 6.1128998 5.3696798 5.178851 5.069433 5.236264 5.016165 5.9413416 5.2172198
ENSG00000000457 SCYL3 2.6672227 2.9043481 3.385970 3.300182 3.574642 3.398637 3.0228694 2.9480141
ENSG00000000460 C1orf112 5.0905228 4.0139308 2.858335 3.162698 3.047041 3.316920 4.6966739 4.7295536
ENSG00000000971 CFH 0.0413653 0.0413674 4.358800 3.809138 5.561809 5.157847 0.1652041 0.0833036
ENSG00000001036 FUCA2 5.2075738 5.9031966 5.800671 5.868071 6.426920 6.476647 5.2754532 5.9315805
ENSG00000001084 GCLC 6.0318414 5.2893284 4.361478 4.632048 4.755631 4.598725 5.7067888 5.6577783
ENSG00000001167 NFYA 5.7438846 5.3461439 5.036974 4.741960 5.222099 5.389108 5.8524971 5.7514902
ENSG00000001460 STPG1 2.9700256 3.4449558 3.563921 3.891790 3.691407 3.274257 3.3576937 3.8152756
ENSG00000001461 NIPAL3 3.5596589 3.7051759 6.746265 7.702369 5.860883 5.709398 3.7991991 3.9819321


Numerical matrix

#Numerical matrix for heatmap
heatmap_matrix <- new_normalized_count_data[,3:ncol(new_normalized_count_data)]

#Assign rowname by ensemble_gene_id
rownames(heatmap_matrix) <- make.names(new_normalized_count_data$ensembl_gene_id, unique=TRUE)

#Assign colnames by gene names
colnames(heatmap_matrix) <- colnames(new_normalized_count_data[,3:ncol(new_normalized_count_data)])


Linear model

  • Create a design matrix, which need to create linear model
#Model design based on samples cell type
model_design <- model.matrix(~samples$cell_type)

#Show the table
kable(model_design, type="html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
(Intercept) samples$cell_typeiPSC
1 1
1 1
1 0
1 0
1 0
1 0
1 1
1 1


  • Create data matrix
  • Website that explain about assayData (n.d.)
#Create matrix that contain infomration of normalized data from column 3 to end
expressionMatrix <- as.matrix(new_normalized_count_data[,3:ncol(new_normalized_count_data)])

#Ensemble gene as row name
rownames(expressionMatrix) <- new_normalized_count_data$ensembl_gene_id

#Set column name based on normalized data
colnames(expressionMatrix) <- colnames(new_normalized_count_data)[3:ncol(new_normalized_count_data)]

#Create minimal set by using assayData, which must contain a matrix expression with rows and representing features and columnes representing samples
minimalSet <- ExpressionSet(assayData=expressionMatrix)

#Fit data to the above model
fit <- lmFit(minimalSet, model_design)


  • Apply empirical Bayes to compute differential expression for the above described model
  • The parameter trend=TRUE is specific to RNA-seq data
  • I use BH threshold to reduce the false discovery rate, which will avoid type 1 error. The other adjust method are too extreme that either increase false negative or false positive for this data. (n.d.)
#Apply empitical bayes to fit data
ebayes_fit <- eBayes(fit, trend=TRUE)

#Collect the data that adjust method, which is BH, applied
topfit <- topTable(ebayes_fit,
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

#Merge hgnc names to topfit table
output_hits <- merge(new_normalized_count_data[,1:2],
                     topfit,
                     by.y = 0, by.x = 1,
                     all.y = TRUE)
#Sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#Table

kable(output_hits[1:10,], type="html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
11052 ENSG00000174099 MSRB3 -5.728253 4.526069 -31.58745 0 2.9e-06 13.74448
9937 ENSG00000167123 CERCAM -4.750546 5.418317 -29.91980 0 2.9e-06 13.39331
8635 ENSG00000159216 RUNX1 -6.816623 4.104574 -28.95348 0 2.9e-06 13.17504
4058 ENSG00000117228 GBP1 -5.720482 3.007139 -27.50517 0 2.9e-06 12.82560
13558 ENSG00000198796 ALPK2 -7.574717 4.591906 -27.04401 0 2.9e-06 12.70833
4224 ENSG00000119242 CCDC92 -5.380433 3.097919 -26.99704 0 2.9e-06 12.69621
9741 ENSG00000166147 FBN1 -6.899814 7.454708 -26.52532 0 2.9e-06 12.57275
9042 ENSG00000162849 KIF26B -5.353285 3.623001 -26.33549 0 2.9e-06 12.52213
3169 ENSG00000109099 PMP22 -7.085657 4.996938 -25.57108 0 3.2e-06 12.31267
7668 ENSG00000148516 ZEB1 -7.011617 3.941617 -25.26380 0 3.2e-06 12.22585


  • 10741 of genes pass the threshold p-value <0.05.
length(which(output_hits$P.Value < 0.05))
## [1] 10741


  • 9869 genes pass correction
length(which(output_hits$adj.P.Val < 0.05))
## [1] 9869


Multiple Hypothesis Test

Design matrix

  • When number of tests performed increases, a false positive might increase
  • Multiple hypothesis testing will come up for differential expression, pathways analysis, and for any analysis where there are multiple tests being performed
  • Control for family-wise error rate or for false discovery rate
#Create a design matrix
model_design_pat <- model.matrix(
  ~samples$patient + samples$cell_type) #Design matrix based on cell type and sample group
#Check the matrix
table_model_design_pat <- as.data.frame(model_design_pat)
table_model_design_pat
##   (Intercept) samples$patientB samples$patientC samples$patientD
## 1           1                0                0                0
## 2           1                1                0                0
## 3           1                0                0                0
## 4           1                1                0                0
## 5           1                0                1                0
## 6           1                0                0                1
## 7           1                0                1                0
## 8           1                0                0                1
##   samples$cell_typeiPSC
## 1                     1
## 2                     1
## 3                     0
## 4                     0
## 5                     0
## 6                     0
## 7                     1
## 8                     1


  • fit data to the model_design_cell
fit_pat <- lmFit(minimalSet, model_design_pat)


Differential Expression

  • Apply empirical Bayes to compute differential expression for the above described model
#The parameter trend = TRUE is specific to RND-seq data
ebayes_fit_pat <- eBayes(fit_pat, trend = TRUE)
topfit_pat <- topTable(ebayes_fit_pat,
                       coef = ncol(model_design_pat),
                       adjust.method = "BH",
                       number = nrow(expressionMatrix))

#Merge hgnc names to topfit table
output_hits_pat <- merge(new_normalized_count_data[, 1:2],
                         topfit_pat, by.y=0, by.x=1, all.y=TRUE)
#Sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]

#Check
kable(output_hits_pat[1:10,], type="html")%>%
  kable_styling(bootstrap_options = "striped", full_width = F)
ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
190 ENSG00000008441 NFIX -7.286118 4.519172 -38.40576 0e+00 0.0001382 9.189616
8635 ENSG00000159216 RUNX1 -6.816623 4.104574 -37.51832 0e+00 0.0001382 9.113968
4272 ENSG00000119681 LTBP2 -7.623419 6.060059 -37.04555 0e+00 0.0001382 9.072156
2506 ENSG00000103888 KIAA1199 -6.094511 6.036422 -36.64151 0e+00 0.0001382 9.035557
509 ENSG00000041982 TNC -6.931249 7.216754 -32.52452 1e-07 0.0002231 8.611453
9741 ENSG00000166147 FBN1 -6.899814 7.454708 -31.53934 1e-07 0.0002231 8.494229
399 ENSG00000026508 CD44 -5.801883 6.891218 -29.80103 1e-07 0.0002300 8.269947
11052 ENSG00000174099 MSRB3 -5.728253 4.526069 -29.45561 1e-07 0.0002300 8.222518
11294 ENSG00000176046 NUPR1 -6.668329 3.755843 -29.29102 1e-07 0.0002300 8.199566
9477 ENSG00000164694 FNDC1 -6.484058 4.635066 -27.55935 2e-07 0.0002938 7.943458


  • 10263 genes pass the threshold p-value < 0.05
  • 10263 significantly differential expressed
length(which(output_hits_pat$P.Value < 0.05))
## [1] 10263


  • 0 genes pass the threshold correction
length(which(output_hits_pat$adj.P.Value < 0.05))
## [1] 0


  • Use Quasi to calculate differential expression genes
#Set up edgeR objects
edge_obj = DGEList(counts=filtered_data_matrix, group=samples$cell_type)

#Estimate Dispersion - our model design
edge_obj <- estimateDisp(edge_obj, model_design_pat)

#Fit the model
fit <- glmQLFit(edge_obj, model_design_pat)


  • Calculate differential expression using the Quasi likelihood model
qlf.iPSC_vs_hASTRO <- glmQLFTest(fit, coef='samples$cell_typeiPSC')

#plot the table
topTags(qlf.iPSC_vs_hASTRO)
## Coefficient:  samples$cell_typeiPSC 
##                      logFC   logCPM         F       PValue          FDR
## ENSG00000147883  -6.782249 5.561304 1157.7616 8.891990e-08 0.0009823517
## ENSG00000176046  -8.428381 5.999789 1054.9682 1.156184e-07 0.0009823517
## ENSG00000119681  -7.686570 8.873231  762.0897 2.893571e-07 0.0012926821
## ENSG00000117226  -8.850286 4.040764  708.0475 3.559903e-07 0.0012926821
## ENSG00000174099  -5.977585 6.273533  682.7683 3.943740e-07 0.0012926821
## ENSG00000000971  -8.535409 3.672329  575.9320 6.366409e-07 0.0012926821
## ENSG00000164106 -10.275179 5.457191  557.4834 6.976773e-07 0.0012926821
## ENSG00000162849  -6.110159 5.183740  533.9295 7.876935e-07 0.0012926821
## ENSG00000159216  -7.965852 6.387569  533.5658 7.892034e-07 0.0012926821
## ENSG00000214336   8.112168 3.740764  530.2342 8.032190e-07 0.0012926821


  • Grab all the results
qlf_output_hits <- topTags(qlf.iPSC_vs_hASTRO, 
                           sort.by = "PValue",
                           n = nrow(new_normalized_count_data),
                           adjust.method = "BH")  #Use BH as adjust method for this assignment


  • 10217 genes pass the threshold p-value < 0.05, which indicates that 10127 genes are significantly differentially expressed
  • I use threshold p-value less than 0.05 because it is significant enough to get differently expressed genes
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 10217


  • 9042 genes pass the correction
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 9042


#Calculate normalization factors
edge_objs <- calcNormFactors(edge_obj)

#Dit model
fits <- glmQLFit(edge_objs, model_design_pat)

#calculate differential expression
qlf.hiPSC_vs_hASTRO <- glmQLFTest(fits, coef='samples$cell_typeiPSC')


  • Get all the result
qlf_output_hit <- topTags(qlf.hiPSC_vs_hASTRO,sort.by = "PValue",
                          n = nrow(new_normalized_count_data),
                          adjust.method = "BH")


  • 9977 genes pass the threshold p-value < 0.05
length(which(qlf_output_hit$table$PValue < 0.05))
## [1] 9977


  • 8684 genes pass correction
length(which(qlf_output_hit$table$FDR < 0.05))
## [1] 8684


  • Output top
topTags(qlf.hiPSC_vs_hASTRO)
## Coefficient:  samples$cell_typeiPSC 
##                      logFC   logCPM         F       PValue          FDR
## ENSG00000147883  -7.072359 5.561304 1356.5603 6.228246e-08 0.0006348592
## ENSG00000176046  -8.724601 5.999789 1271.3433 7.472008e-08 0.0006348592
## ENSG00000117226  -9.127679 4.040764  805.1695 2.688471e-07 0.0011341151
## ENSG00000119681  -7.979054 8.873231  729.4311 3.544762e-07 0.0011341151
## ENSG00000174099  -6.269017 6.273533  663.8840 4.612822e-07 0.0011341151
## ENSG00000000971  -8.831714 3.672329  651.8760 4.854351e-07 0.0011341151
## ENSG00000162849  -6.402484 5.183740  640.2579 5.104639e-07 0.0011341151
## ENSG00000159216  -8.260542 6.387569  610.8815 5.820706e-07 0.0011341151
## ENSG00000164106 -10.535707 5.457191  604.0468 6.006612e-07 0.0011341151
## ENSG00000103888  -6.315549 8.034202  554.7927 7.617441e-07 0.0011660431


  • 4930 genes are up regulated (downregulated when differentiate iPSC to astrocyte)
length(which(qlf_output_hit$table$PValue < 0.05 & qlf_output_hit$table$logFC > 0))
## [1] 4930


  • 5047 genes are down regulated (upregulated when differentiate iPSC to astrocyte)
length(which(qlf_output_hit$table$PValue < 0.05 & qlf_output_hit$table$logFC < 0))
## [1] 5047


#merge gene names with the top hits
qlf_output_hits_withgn <- merge(new_normalized_count_data[,1:2],qlf_output_hit, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]

Volcano plot

  • Upregulated genes (red) are located mostly lower side of graph while upregulated genes (blue) are located upper side of graph.
  • Gene of interest (orange), which is part of upregulated gene, is located lower side of graph.
  • No differentailly expressed genes (grey) are located at middle, which indicates that iPSC and ipsc-derived astrocyte express these gene in similar level.
#P values from data that apply multiple hypothesis testing
qlf_cell_model_pvalues <- data.frame(ensembl_id = rownames(qlf_output_hits$table),
                                     qlf_cell_pvalue=qlf_output_hits$table$PValue)

#downregulated when differentiate iPSC to astrocyte
downregulated_gene <- qlf_output_hits_withgn[which(qlf_output_hits_withgn$PValue < 0.05 & qlf_output_hits_withgn$logFC > 0),]
upregulated_gene <- qlf_output_hits_withgn[which(qlf_output_hits_withgn$PValue < 0.05 & qlf_output_hits_withgn$logFC < 0),] 

#Plot for gene of interest CD44
ensembl_of_interest <- new_normalized_count_data$ensembl_gene_id[
                       which(new_normalized_count_data$hgnc_symbol == "CD44")]

#Downregulated gene ensembl_id 
downregulated_gene <-  data.frame(ensembl_id = downregulated_gene$ensembl_gene_id)

#Upregualted gene ensembl_id
upregulated_gene <-  data.frame(ensembl_id = upregulated_gene$ensembl_gene_id)

qlf_cell_model_pvalues$colour <- "grey" #Grey for all genes
qlf_cell_model_pvalues$colour[upregulated_gene$ensembl_id] <- "red" #Red for upregulated
qlf_cell_model_pvalues$colour[downregulated_gene$ensembl_id] <- "blue" #Blue for downregulated
qlf_cell_model_pvalues$colour[qlf_cell_model_pvalues$ensembl_id==ensembl_of_interest] <- "orange" #Orange for gene of interest

volcanoplot(ebayes_fit,
            coef = ncol(ebayes_fit),
            ylab = "M-ratio log expression",
            cex = ifelse(qlf_cell_model_pvalues$colour == "orange", 2, 0.3),
            col = qlf_cell_model_pvalues$colour,
            main = "Upregulated genes vs Downregulated genes")


Heat Map

  • Heat map of top hits using the Quasi likelihood model
  • Heat map is graph that translate numbers into scale of color
  • Good for summary of what are data looks like
  • Same cell_type cluster together in heat map as we can see clear block of gene expression between iPSC and astrocyte.
  • This can happen because author need to provide specific transcription factor to differentiate fibroblast to iPSC and iPSC to astrocyte. Since those specific transcription factors will significantly up or down regulated, the cluster would be more
#Set top hit
top_hits <- rownames(qlf_output_hits$table)[output_hits_pat$P.Value<0.05]
#Set heatmap matrix tophits
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))

#Sorth the column by cell type
#Organize by cell type
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits), pattern = "^iPSC"),
                                                     grep(colnames(heatmap_matrix_tophits), pattern = "^hASTRO"))]

if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), #if no negative value in heatmap matrix, 
                             c( "white", "red"))                 #use white and red color
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red")) #blue, white, red if heatmap matrix contain negative value
  }
#Plot
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE, 
                           col = heatmap_col,
                           show_column_names = TRUE, 
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           column_title  = "Heatmap for top hits")
current_heatmap


Thresholded Over-Represenataion Analysis

downregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
  which(qlf_output_hits_withgn$PValue < 0.05            #downregulated when differentiate iPSC to astrocyte 
             & qlf_output_hits_withgn$logFC > 0)]
upregulated_genes <- qlf_output_hits_withgn$hgnc_symbol[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC < 0)]      #upregulated when differentiate iPSC to astrocyte
write.table(x=downregulated_genes,
            file="cell_downregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)    #save as downregulated file 
write.table(x=upregulated_genes,
            file="cell_upregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)   #save as upregulated file 


#Check to make sure upregulated genes list contain the astrocyte specific gene
upregulated_genes_dataframe <- as.data.frame(upregulated_genes)
#List of gene that expressed in astrocyte
list_upregulated <- c(upregulated_genes_dataframe[upregulated_genes_dataframe == "CD44"],
                      upregulated_genes_dataframe[upregulated_genes_dataframe == "SOX9"],
                      upregulated_genes_dataframe[upregulated_genes_dataframe == "DIO2"])
#Remove NA
list_upregulated <- list_upregulated[!is.na(list_upregulated)]
list_upregulated
## [1] "CD44" "SOX9" "DIO2"


#Check to make sure upregulated genes list contain the astrocyte specific gene
downregulated_genes_dataframe <- as.data.frame(downregulated_genes)
#List of gene that expressed in iPSC
list_downregulated <- c(downregulated_genes_dataframe[downregulated_genes_dataframe == "MYC"],
                      downregulated_genes_dataframe[downregulated_genes_dataframe == "NANOG"],
                      downregulated_genes_dataframe[downregulated_genes_dataframe == "SALL4"])
#Remove NA
list_downregulated <- list_downregulated[!is.na(list_downregulated)]
list_downregulated
## [1] "MYC"   "NANOG" "SALL4"


Non-thresholded Gene set Enrichment Analysis

What method did I use?

I use Gene Set Enrichment Analysis (GSEA) becuase it is currently most used method for non-thresholded analysis. GSEA has Java GUI that serve as easy tool for researchers who want to do non-threshold analysis without writing down codes.The version of GSEA that I use is GSEA 4.0.3 Java GUI.

What genesets did I use?

I use “Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt” because it includes human geneset of MsigBD, Institue of BIoinformatics, GO, Reactome, and Panther. link

Download the Genesets

# Download February GMT file from bader lab website

gmt_url = "http://download.baderlab.org/EM_Genesets/February_01_2020/Human/symbol/"

#list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)

#get the gmt that has all the pathways and does not include terms inferred from electronic annotations(IEA)
#start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
  contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))

working_dir <- "/home"
dest_gmt_file <- file.path(working_dir,paste(gmt_file, sep=""))

download.file(
    paste(gmt_url,gmt_file,sep=""),
    destfile=dest_gmt_file
)


Rank Non-Thresholded Genesets

#Compute ranks
qlf_output_hits_withgn[,"rank"] <- log(qlf_output_hits_withgn$PValue, base = 10)*
                                   sign(qlf_output_hits_withgn$logFC)

#Sort table by ranks
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank, decreasing = TRUE),]

#Remove any gene that don't have hgnc_symbol or label as NA
qlf_output_hits_withgns <- qlf_output_hits_withgn[!(nchar(qlf_output_hits_withgn$hgnc_symbol) == 0),]
qlf_output_hits_withgns <- qlf_output_hits_withgns[!is.na(qlf_output_hits_withgns$hgnc_symbol),]

#Create dataframe that only contain hgnc symbol and rank
qlf_output_hits_withgns <- qlf_output_hits_withgns[,c("hgnc_symbol", "rank")]

#Table
kable(qlf_output_hits_withgns [1:5,], type="html") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
hgnc_symbol rank
7603 CDKN2B 7.205634
11294 NUPR1 7.126563
4057 GBP3 6.570495
4272 LTBP2 6.450413
11052 MSRB3 6.336033


#Save as file for GSEA
write.table(qlf_output_hits_withgns,file="cell_expression.rnk",quote=F,sep="\t",row.names=F, col.names = F)


GSEA

Setting of GSEA


###Summary of Enrichment### This is the summary of enrichment that reached statistical significance in the p-value and FDR

Enrichment Results and Comparison to result of g:profiler

Set of upregulated pathways in differentiated Astrocyte


List Upregulated Pathways


Graph of First 6 Upregualted Pathways


As expected, pathways that related to characteristic of astrocyte detected as upregulated pathways. The first four upregulated pathways are: - Epithelial to mesenchymal transition - Elastic Fibre Formation - Molecules associated with elastic fibres - TNFA singlaing via NFKB

Epithelial to mesenchymal transition should be upregulated since iPSC need to go under differentiation to become astrocyte. Therefore, need to downregulated the epithelial characteristic (iPSC) and upregulate mesenchymal characteristic (astrocyte). This pathway commonly upregulated during embryogenesis and organ development (Kalluri and Weinberg 2009). Synthesis of elastic fibres and associated molecules, such as fibrillin-1 and fibrillin-2, are commonly produced in astrocyte since elastic fibers are one of major components of the extracellular matrix of the nerve head (Pena, Mello, and Hernandez 2000). TNF is cell singlaing protein involve in systemic inflammation and one of cytokines that make up the acute phase reaction. TNF produced in neuron and astrocyte. NFK is transcription regulator that regulate TNF. TNF must be regulated to prevent variety of human disease (Gahring et al. 1996). Most of the pathways that upregulated in GSEA are somewhat similar compare to upregulated pathways in g:profilers. However, some of pathways, such as UV response and positive cartilage regualtion, are not detect in g:profiler while detected in GSEA.

Set of Upregulated Pathways in Differentiated Astrocyte


List Downregulated Pathways


Graph of First 6 Downregulated Pathways


As expected, pathways that related to characteristic of iPSC detected as downregulated pathways The first four downregulated pathways are: - E2F upregulation - MYC uprgulation - DNA replication - DNA dependent DNA replication


E2F is transcription factors that regulate genes that important in cell profileration, specifically progression through G1 and into S-phase. E2F is known replication factors during embryogenesis, which indicates that should expressed in iPSC since iPSC is characteristically similar to embryonic stem cell (Myster, Bonnette, and Duronio 2000). MYC is common transcription factors that used to differentiate adult cell to iPSC (Takahashi et al. 2008). iPSC is the cell that well known for high replication rate because of its pluripotency and differentiation. To maintain pluripotencym, expression of DNA repair and replication must be maintained (Liu et al. 2017). Most of the pathways that down regulated in GSEA are similar to list of down regulated pathways in g:profiler and not detect any differences. This might happen because of unique characteristic of iPSC that is not detected in other type of cell.

Enrichment Analysis in Cytoscape

Initial Enrichment Map


  • There are 5633 nodes and 755 edge in initial enrichment map.
  • P-value and Q-value for node cutoff is 0.01 and 0.1 correspondingly.
  • The similarity coefficient for edge cut off is 0.375.
  • I choose 0.01 as P-value because I want to keep network more rigid.
  • I choose 0.1 as Q-value because as lower the Q-value, vanish many red nodes (upregulated pathways). I want to keep as many red nodes as possible since there are few red nodes comapre to blue nodes (downregulated pathwyas).
  • I choose default similrity coefficient for edge cutoff (0.375) because as increase similarity coefiicient value, significant amount of edge removed. Eventhough high threshold make more rigid formation of edge, I want to keep many edges as possible.

Comparison between result of g:profielr and GSEA is not straight forward because of way that g;profiler and GSEA process to get result is different. g;profiler use thresholded list to ask any gene sets that are enriched or depleted in list. It uses fisher exact test and hypergeometri test to assess. However, GSEA uses non-thresholded list to ask any gene sets that are ranked high or low in list. It uses KS test and linear model.

Annotate network

Autoannotated Parameters


  • MCL Cluster algorithm uses for clustserMaker2.
  • Similarity_coefficient from enrichment map used for edge weigth column.
  • Create singleton cluster and prevent cluster overlap since many genes involve in many pathays.
  • Attribute name created by using with GS_DESCR.
  • Label the cluster created by using WordCloud:Adjacent Words (default).
  • Set 3 as max words per label and 8 as adjacent word bonus.
  • Max and min number of words per cloud are 250 and 1 correspondingly.
  • Word Aggregation cut off is 1.

Annotated Network

Figure 1. Enrichment map after autoannotation

Upregulated Subnetwork

Figure 2. Subnetwork of upregulated pathways. It divide into 6. A is nodes that related to extracellular matrix, B is node about ebola virus, C is nodes related to regulation, D is nodes related to immunity, E is nodes related to system within cells, and F is node related to junction. Yellow nodes as selected nodes, first four pathways on upreguatled pahtways.
According to original paper, author said that astrocyte should have characteristic of capacity to produce ATP and propagate inercellular Ca2+ (Domenico et al. 2019). I cannot find any pathway that related to produce ATP in subnetwork.This might happen because the pathway that related to capacity to produce ATP is not significant enough. However, I dectect pahtway that propagate intercellular Ca2+ waves, which is platelet cytosolic ca2+ in section E of figure 2. Therefore, enrichment result partly support conclusions discussed in the original paper. There are similarity between erichment result from GSEA and g;profiler result as both shows junction pathways and extracellular matrix pathways as major pathways that expressed in astrocyte. However, unlike enrichment result from GSEA, g:profiler does not shows immunity pathways, ebola virus pathways, and demannosylation pathways as upregulated pathways in astrocyte. This might happen do to different starting data (ranked geneset for GSEA and list of up and down regulated geneset for g:profiler) and different methods for calculation.

There are many types of nodes, but most of nodes are related to either extracellular matrix and immunity.

Downregulated Subnetwork

Figure 3. Subnetwork of downregulated pathways. It divide into 5 sections. A is nodes that related to DNA manipulation. B is nodes that related to export riboculeoprotein, C is nodes that related to g2m checkpoint, D is nodes that related to myc transcription factors expression, and E related to RNA expression and manipulation. Yellow nodes as selected nodes, first four pathways on downregulated pahtways.
There are many similarity between result from g:profiler and GSEA. For example, both results show DNA replicaiton, recombination, repair, ard RNA processing as most common downregulated pathways. However, g:profiler did not shows specific pathways, such as pathways related to fanconi anemia, which related to DNA repair. Since characteristic of iPSC is unique, unlike upregulated pathways in astrocyte, downregulated pathways, which is pathways related to iPSC, are mostly similar even with different methods.

Collapsed Network

Collapsed Network

Figure 4. Erichment map after collapse.

Collapsed subnetwork

Figure 5. subnetwork of collapse network.
The theme of down regulated network (left side, blue) are cell cycle, specifically regulation of cell cycle. The theme of upreguatled network are production extracellular matrix. Original paper did not mention about extracellular matrix while mention that astrocyte should have high expression that related to intercellular Ca2+ waves. There is one collapsed node that related to intercellular ca2+ wave (platelet cytocolic ca2+) (Domenico et al. 2019). Therefore, enrichment results not able to fully support mechanism discussed in the original paper. However, from otehr papers, the production of extracellular matrix is one of characteristic of astrocyte that does not present in iPSC since astrocyte release wide range of extracellular matrix molecule that is important for glial development. Ebola virus pathways is the novel pathway detected in collapsed upregulated network (Domenico et al. 2019). I cannot find any papers that shows relationship between astrocyte and ebola virus. This pathways might assigned as upregulated pathways because author from original paper use virus to transfer specific genetic material to iPSC to differentiate into astrocyte. Downregualted pathways are the pathways that commonly foudn in iPSC, cell cycle (Domenico et al. 2019). Also, c-myc with other trasncription factors used to differentiate fibroblast to iPSC, which shows as downregulated pathways (Takahashi et al. 2008). These conclude that author in original paper successfully differentiate iPSC to astrocyte (Wiese, Karus, and Faissner 2012).

##Post Analysis ##

Post Analysis

Figure 6. Erichment map with signature analysis.

Part of Post Analysis

Figure 7. Part of erichment map with signature analysis.

I choose ATF3 as transcription factors for post analysis to my main network because ATF3 is driver of astrocyte differentiation from nerual precursor cells. ATF3 and NFIA transcription factors bind to specific binding site that activate eA genes and IA genes stage of early astrogligenesis. Then, RUNx2 transcription factor can bind and prevent eA genes to become mature astrocyte. I add signature gene sets with parameter of Mann Whitney (one-sided greater) with cutoff of 0.05. I use Mann Whiteney one sided greater to look at the relationship of ATF3 transcription factors to one of upregulated pathways. I detect highly engaged in apoptosis pathways, negative epithelial prolierataion pathway, tnfa signalling pathway, epithelial mesenchymal transition pathway, and hypoxia pathways. ATF3 might part of pathways that cause epithelial to mesenchymal transition pathways since ATF3 is require to drive astrocyte differentiation, but not maturation to astrocyte. ATF3 is also weakly related to downregulated pathways, such as metaphase sister chromatid and rna splicing, because at early stage, cell still has little bit stem cell characteristic and need to keep ability to high cell division rate(Tiwari et al. 2018).


References

Domenico, Angelique di, Giulia Carola, Carles Calatayud, Meritxell Pons-Espinal, Juan Pablo Muñoz, Yvonne Richaud-Patin, Irene Fernandez-Carasa, et al. 2019. “Patient-Specific iPSC-Derived Astrocytes Contribute to Non-Cell-Autonomous Neurodegeneration in Parkinson’s Disease.” Stem Cell Reports 12 (2). Elsevier: 213–29.

Gahring, Lorise C, Noel G Carlson, Rachel A Kulmer, and Scott W Rogers. 1996. “Neuronal Expression of Tumor Necrosis Factor Alpha in the Ovouji Fme Brain.” Neuroimmunomodulation 3 (5). Karger Publishers: 289–303.

Kalluri, Raghu, and Robert A Weinberg. 2009. “The Basics of Epithelial-Mesenchymal Transition.” The Journal of Clinical Investigation 119 (6). Am Soc Clin Investig: 1420–8.

Liu, Kai, Jian Mao, Lipu Song, Anran Fan, Sheng Zhang, Jianyu Wang, Nana Fan, et al. 2017. “DNA Repair and Replication Links to Pluripotency and Differentiation Capacity of Pig iPS Cells.” PloS One 12 (3). Public Library of Science: e0173047.

Myster, Denise L, Peter C Bonnette, and Robert J Duronio. 2000. “A Role for the Dp Subunit of the E2f Transcription Factor in Axis Determination During Drosophila Oogenesis.” Development 127 (15). The Company of Biologists Ltd: 3249–61.

Pena, Janethe DO, Paulo AA Mello, and M Rosario Hernandez. 2000. “Synthesis of Elastic Microfibrillar Components Fibrillin-1 and Fibrillin-2 by Human Optic Nerve Head Astrocytes in Situ and in Vitro.” Experimental Eye Research 70 (5). Elsevier: 589–601.

Takahashi, Kazutoshi, Koji Tanabe, Mari Ohnuki, Megumi Narita, Tomoko Ichisaka, Kiichiro Tomoda, and Shinya Yamanaka. 2008. “Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors.” Obstetrical & Gynecological Survey 63 (3). LWW: 153.

Tiwari, Neha, Abhijeet Pataskar, Sophie Péron, Sudhir Thakurela, Sanjeeb Kumar Sahu, Marı'a Figueres-Oñate, Nicolás Marichal, Laura López-Mascaraque, Vijay K Tiwari, and Benedikt Berninger. 2018. “Stage-Specific Transcription Factors Drive Astrogliogenesis by Remodeling Gene Regulatory Landscapes.” Cell Stem Cell 23 (4). Elsevier: 557–71.

Wiese, Stefan, Michael Karus, and Andreas Faissner. 2012. “Astrocytes as a Source for Extracellular Matrix Molecules and Cytokines.” Frontiers in Pharmacology 3. Frontiers: 120.